This dataset explores the different chemical properties of red wine that affect its quality.

Univariate Plots Section

As shown below the dataset has 1599 rows and 12 columns. The following analysis also confirms that this dataset is relatively tidy.

## [1] 1599
## [1] 12
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1594           6.8            0.620        0.08            1.9     0.068
## 1595           6.2            0.600        0.08            2.0     0.090
## 1596           5.9            0.550        0.10            2.2     0.062
## 1597           6.3            0.510        0.13            2.3     0.076
## 1598           5.9            0.645        0.12            2.0     0.075
## 1599           6.0            0.310        0.47            3.6     0.067
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 1594                  28                   38 0.99651 3.42      0.82
## 1595                  32                   44 0.99490 3.45      0.58
## 1596                  39                   51 0.99512 3.52      0.76
## 1597                  29                   40 0.99574 3.42      0.75
## 1598                  32                   44 0.99547 3.57      0.71
## 1599                  18                   42 0.99549 3.39      0.66
##      alcohol quality
## 1594     9.5       6
## 1595    10.5       5
## 1596    11.2       6
## 1597    11.0       6
## 1598    10.2       5
## 1599    11.0       6
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The data on quality appears to be normally distributed. Also note that Quality scores are discrete values between 3 and 8.

Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Fixed Acidity values appear to be slightly skewed to the right. Using a log scale normalises this data.

Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Volatile Acidity values appear to be slightly skewed to the right. Using a log scale normalises this data.

Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Citric Acid values appear to be skewed to the right. Using a log scale normalises this data. However there are some issues using this log_scale as the data is between 0 and 1.

Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Residual Sugar values appear to be skewed to the right. Using a log scale does not normalise the data. This suggests that there are some really extreme values for residual sugar.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chloride values appear to be skewed to the right. Using a log scale normalises the data.

Free Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Free Sulfur Dioxide values appear to be skewed to the right. Using a log scale somewhat normalises this data. It appears that the log scale is a summation of two normal distributions.

Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Total Sulfur Dioxide values appear to be skewed to the right. Using a log scale somewhat normalises this data.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Density values appear to normal.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH values appear to be normal and it seems no tranformation is required.

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Sulphates values appear to be skewed to the right. Using a log scale normalises this data.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Alcohol values appear to be skewed to the right. Using a log scale does not normalise the data.

Univariate Analysis

What is the structure of your dataset?

There are 1599 varieties of red wine in the dataset with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality). All the variables in this dataset are numerical. The exception to this is the quality variable which is an integer score that can also be used as an ordered factor variable.

Other observations:

The median quality score is is 6.

Quality Score only range between 3 and 8

A lot of the obervations are measurements that are greater than 0. This combined with the fact that the dataset is relatively small has caused a lot of the datasets to be skewed to the right. Therefore there isn’t enough justification to use a log scale.

What is/are the main feature(s) of interest in your dataset?

The main feature of this dataset is the quality scores of the wines. I would like to determine which combination of chemical properties can be attributed towards these quality scores.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I believe the main features affecting this should be the acidity level, the density and the alcohol content. Other factors may also be of influence. For example the level of fixed acidity and volatile acidity may also affect the ph and therefore have an impact. Similarly residual sugar may have an impact on alcohol content which in turn may affect the quality. Since the quality score is an arbitrary rating given by a judge it is hard to say which features are the most important at this stage of the analysis.

Did you create any new variables from existing variables in the dataset?

No. I modify variables later once I am able to determine the important variables affecting the quality score (after completing bivariate analysis).

Of the features you investigated, were there any unusual distributions?

No. I perform changes later once I am able to determine the important variables affecting the quality score (after completing bivariate analysis).

Bivariate Plots Section

Bivariate Correlation Analysis

The analysis above was done to find any significant bivariate relationships. It tries to capture any linear relationships using a correlation coeffiecient. Coefficent close to 1 indicate a strong direct relationship between factors and coefficents close to -1 indicate a strong inverse relationship between factors.

It seems that the quality score is affected by the following factors:

  1. Volatile Acidity (Correlation -0.391)

  2. Citric acid (Correlation 0.226)

  3. Sulphates (Correlation 0.251)

  4. Alcohol (Correlation 0.476)

Of these I think alcohol and volatile acidity are the most significant.

The following other relationships were also observed:

  1. Volatile acidity and fixed acidity (Correlation -0.256)

  2. Fixed acidity and citric acid (Correlation 0.672)

  3. Volatile acidity and citric acid (Correlation -0.552)

  4. Density and fixed acidity (Correlation 0.668)

  5. Density and citric acid (Correlation 0.365)

  6. pH and fixed acidity (Correlation -0.683)

  7. pH and volatile acidity (Correlation 0.235)

  8. pH and citric acid (Correlation -0.542)

  9. pH and density (Correlation -0.342)

  10. sulphates and volatile acidity (Correlation -0.261)

  11. sulphates and citric acid (Correlation 0.313)

  12. alcohol and density (Correlation -0.496)

  13. alcohol and pH (Correlation 0.206)

Of these I think the following relationships are the most important:

  1. fixed acidity and density
  2. fixed acidity and pH
  3. fixed acidity and citric acid

Quality vs Volatile Acidity

As shown above there appears to be a negative relationship between quality and volatile acidity. However it does not seem that this relationship is linear.

Quality vs Citric Acid

It seems that the relationship between citric acid and quality is almost not existent. The downward trend at the end is skewed by an extreme value.

Quality vs Sulphates

There does not seem to be any relationship between quality and sulphates.

Quality vs Alcohol

Alcohol seems to improve quality up to a certain point (13% alcohol). However as shown this relationship is not linear.

Fixed acidity vs Density

There is some evidence of a linear relationship between density and fixed acidity.

Fixed acidity vs pH

There is some evidence of an inverse linear relationship between pH and fixed acidity.

Fixed acidity vs Citric Acid

There is some evidence of a linear relationship between citric acid and fixed acidity. The downtrend at the end is misleading as it is the result of only one extreme value.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

It seems that the quality score is affected by the following factors:

  1. Volatile Acidity (Correlation -0.391)

  2. Citric acid (Correlation 0.226)

  3. Sulphates (Correlation 0.251)

  4. Alcohol (Correlation 0.476)

Of these I think alcohol and volatile acidity are the most significant.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The following relationships were observed:

  1. Volatile acidity and fixed acidity (Correlation -0.256)

  2. Fixed acidity and citric acid (Correlation 0.672)

  3. Volatile acidity and citric acid (Correlation -0.552)

  4. Density and fixed acidity (Correlation 0.668)

  5. Density and citric acid (Correlation 0.365)

  6. pH and fixed acidity (Correlation -0.683)

  7. pH and volatile acidity (Correlation 0.235)

  8. pH and citric acid (Correlation -0.542)

  9. pH and density (Correlation -0.342)

  10. sulphates and volatile acidity (Correlation -0.261)

  11. sulphates and citric acid (Correlation 0.313)

  12. alcohol and density (Correlation -0.496)

  13. alcohol and pH (Correlation 0.206)

Of these I think the following relationships are the most important:

  1. fixed acidity and density
  2. fixed acidity and pH
  3. fixed acidity and citric acid

What was the strongest relationship you found?

A negative relationship between pH and fixed acidity (Correlation -0.683)

Multivariate Plots Section

Quality, volatile acidity and alcohol

Since alcohol and volatile acidity were the most signicant factors in affecting quality score, I decided to analyse this relationship further. In this case I decided to use the rainbow ROYGBIV (Red, Orange Yellow, Green, Blue, Indigo, Violet) colour scale for alcohol. This is because we know that the optimum score for quality does not occur at extremely high or low values of alcohol but rather values in between. Using a ROYGBIV scale allows us to discern these level more easily than if we used a single sequential colour scale.

As shown in the plot and model above there is definately some evidence of a relationship between alcohol, volatile acidity and quality. We can explore this relationship further by discretising alcohol and volatile acidity.

Quality, volatile acidity and alcohol - A discretised Analysis

The graph above discretises alcohol and volatile acidity so that we can better see what is happening.However as shown it can be hard to discern the colours when they overlap each other due to the jitter property. One way to overcome this problem is to use facets to separate these colours and see which one of them occurs the most in a given region. The following plot shows how we can do this.

As shown in the plot above, there is some evidence that both alcohol and volatile acidity affect quality scores.

Quality, volatile acidity and alcohol - Regression Models

Given the relationship shown above, I decided to model quality, volatile acidity and alcohol. I tested a number of models including some that included other variables and some that use logarithmic functions. However, I found that most models had similar r squared values to the simple model shown below. I decided to choose this simple model, as simpler models tend to have higher predictive power than complex models.

## 
## Call:
## lm(formula = I(quality) ~ I(alcohol) + I(volatile.acidity), data = pf)
## 
## Coefficients:
##         (Intercept)           I(alcohol)  I(volatile.acidity)  
##              3.0955               0.3138              -1.3836
## 
## Call:
## lm(formula = I(quality) ~ I(alcohol) + I(volatile.acidity), data = pf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.59342 -0.40416 -0.07426  0.46539  2.25809 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.09547    0.18450   16.78   <2e-16 ***
## I(alcohol)           0.31381    0.01601   19.60   <2e-16 ***
## I(volatile.acidity) -1.38364    0.09527  -14.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6678 on 1596 degrees of freedom
## Multiple R-squared:  0.317,  Adjusted R-squared:  0.3161 
## F-statistic: 370.4 on 2 and 1596 DF,  p-value: < 2.2e-16

Based on previous analysis I decided to test a discretised version of this model. The results are shown below.

## 
## Call:
## lm(formula = I(quality) ~ I(alcohol.level) + I(volatile.acidity.level), 
##     data = pf)
## 
## Coefficients:
##                  (Intercept)            I(alcohol.level).L  
##                      5.17117                       0.90354  
##           I(alcohol.level).Q            I(alcohol.level).C  
##                     -1.39222                      -0.68555  
##           I(alcohol.level)^4            I(alcohol.level)^5  
##                     -0.59854                      -0.12606  
##           I(alcohol.level)^6            I(alcohol.level)^7  
##                     -0.22484                      -0.05995  
##  I(volatile.acidity.level).L   I(volatile.acidity.level).Q  
##                     -2.98340                      -0.63900  
##  I(volatile.acidity.level).C   I(volatile.acidity.level)^4  
##                     -0.56883                      -0.31472  
##  I(volatile.acidity.level)^5   I(volatile.acidity.level)^6  
##                     -0.59662                      -0.58337  
##  I(volatile.acidity.level)^7   I(volatile.acidity.level)^8  
##                     -0.64377                      -0.19910  
##  I(volatile.acidity.level)^9  I(volatile.acidity.level)^10  
##                     -0.15420                       0.09751  
## I(volatile.acidity.level)^11  I(volatile.acidity.level)^12  
##                      0.06453                       0.15364  
## I(volatile.acidity.level)^13  
##                      0.11735
## 
## Call:
## lm(formula = I(quality) ~ I(alcohol.level) + I(volatile.acidity.level), 
##     data = pf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6134 -0.3603 -0.1082  0.4384  2.1345 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   5.17117    0.12493  41.393  < 2e-16 ***
## I(alcohol.level).L            0.90354    0.42219   2.140 0.032498 *  
## I(alcohol.level).Q           -1.39222    0.41669  -3.341 0.000854 ***
## I(alcohol.level).C           -0.68555    0.33872  -2.024 0.043143 *  
## I(alcohol.level)^4           -0.59854    0.23951  -2.499 0.012553 *  
## I(alcohol.level)^5           -0.12606    0.15286  -0.825 0.409666    
## I(alcohol.level)^6           -0.22484    0.09275  -2.424 0.015461 *  
## I(alcohol.level)^7           -0.05995    0.05513  -1.087 0.276984    
## I(volatile.acidity.level).L  -2.98340    0.40220  -7.418 1.94e-13 ***
## I(volatile.acidity.level).Q  -0.63900    0.39641  -1.612 0.107170    
## I(volatile.acidity.level).C  -0.56883    0.38064  -1.494 0.135268    
## I(volatile.acidity.level)^4  -0.31472    0.35490  -0.887 0.375323    
## I(volatile.acidity.level)^5  -0.59662    0.31584  -1.889 0.059074 .  
## I(volatile.acidity.level)^6  -0.58337    0.29118  -2.003 0.045299 *  
## I(volatile.acidity.level)^7  -0.64377    0.26817  -2.401 0.016482 *  
## I(volatile.acidity.level)^8  -0.19910    0.23728  -0.839 0.401550    
## I(volatile.acidity.level)^9  -0.15420    0.22305  -0.691 0.489448    
## I(volatile.acidity.level)^10  0.09751    0.20844   0.468 0.639998    
## I(volatile.acidity.level)^11  0.06453    0.16828   0.383 0.701436    
## I(volatile.acidity.level)^12  0.15364    0.11844   1.297 0.194755    
## I(volatile.acidity.level)^13  0.11735    0.07833   1.498 0.134270    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6651 on 1578 degrees of freedom
## Multiple R-squared:  0.3302, Adjusted R-squared:  0.3217 
## F-statistic:  38.9 on 20 and 1578 DF,  p-value: < 2.2e-16

As shown this model produces a slightly better r squared. Intuitively this makes sense as there is likely to be small rounding errors in data measurement. Using rounded values is likely to overcome this problem.

Fixed Acidity, Density, pH & Citric Acid

My previous analysis indicates that there is a relationship between fixed acidity, density, pH and citric acid. I decided to explore this relationship further. As it is difficult to explore the relationship between 4 variables I decided to look at 3 variables at a time. First I explored the relatinship between fixed acidity, density and citric acid. The results are shown in the plot below.

As shown fixed acidity increases when either density or citric acid values increase.

Similarly, a relationship between fixed acidity, density and pH can be seen in the plot below.

As shown fixed acidity increases when either density increases but decreases when pH increases.

Having seen that all these 4 variables are related, I decided to combine them in one visualisation so that I could better understand their relationship. In this case I was able to achieve this by discretising values of citric acid.

As shown in the graph above there is a clear relationship between citric acid, density, fixed acidity and pH levels. This is further confirmed in the regression model below.

## 
## Call:
## lm(formula = I(fixed.acidity) ~ I(citric.acid) + I(density) + 
##     I(pH), data = pf)
## 
## Coefficients:
##    (Intercept)  I(citric.acid)      I(density)           I(pH)  
##       -371.809           2.844         394.252          -4.111
## 
## Call:
## lm(formula = I(fixed.acidity) ~ I(citric.acid) + I(density) + 
##     I(pH), data = pf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6657 -0.5179  0.0037  0.5350  4.8048 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -371.8088    12.7339  -29.20   <2e-16 ***
## I(citric.acid)    2.8441     0.1372   20.73   <2e-16 ***
## I(density)      394.2516    12.6660   31.13   <2e-16 ***
## I(pH)            -4.1108     0.1715  -23.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8745 on 1595 degrees of freedom
## Multiple R-squared:  0.7482, Adjusted R-squared:  0.7477 
## F-statistic:  1580 on 3 and 1595 DF,  p-value: < 2.2e-16

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

As shown in the visualisation above there may be an ideal range of alcohol and volatile acidity levels that maximises quality. However this combination is not a simple linear relationship. It seems that the optimal alcohol level is between 13 and 14 and the optimum volatile acidity level is between 0.3 and 0.8. The low r squared term suggest that there are other elements that are missing in this investigation.

Were there any interesting or surprising interactions between features?

As shown in the graph above there is a clear relationship between citric acid, density, fixed acidity and pH levels. From this relationship we can see that citric acid (or components that help make it) is an influential ingredient that affects density, fixed acidity and pH levels of red wine.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

The model that I finally decided to use involved quality as a function of discretised alcohol and volatile acidity levels. I decided to use this model because it was the only simple model and it had a reasonable r squared value (over 0.3). While other factors did improve the r squared scored their impact was negligible. As explained above this model is relatively limited. Much of the variation in this model is left unexplained and it is likely that we are missing important information that could help us predict quality.


Final Plots and Summary

Plot One

Description One

This plot summarises the relationship between quality scores, alcohol, and volatile acidity. I chose to use this plot as it clearly shows that there is an optimum range for alcohol and volatile acidity that maximises quality scores. As shown in the plot, this relationship is not purely linear and more information is required to predict the quality score more accurately.

Plot Two

Description Two

This plot shows the relationship between fixed acidity, density, citric acid and pH. It is important as it clearly shows that fixed acidity increases when either density increases or citric acid increases but decreases as pH increases.

Plot Three

Description Three

This plot tries to combine the information shown in the two plots in plot 2 so that we can see the relationship between fixed acidity, density, PH and citric acid levels. It does this by discretising citric acid levels so that we can use facets that see how pH and density affect fixed acidity. We can also see how fixed acidity increases when we move to higher discretised levels of citric acid.


Reflection

In this exploration the main objective was to explore the relationship between quality scores and the chemical properties of red wine. Our initial analysis showed that we had a limited range of data which cause a lot of our data for individual measurements to be skewed to the right. Doing a bivariate analysis suggested that a relationship existed between quality scores, volatile acidity and alcohol. Further analysis allowed us to plot this relationship and do a linear regression. It seems that a simple or logarithmic regression is not sufficient. Further analysis using step functions seems to be required. More data and time is required to verify this. There is some indication that a step-based relationship could be used as there are ideal ranges of alchohol and volatile acidity where quality scores are maximised.

In addition a relationship was also found between fixed acidity density, pH and citric acid. This relationship had a much better r squared value when regression was performed and was confirmed by several plots. This suggests that citric acid (or some components of it) have a significant impact on density, pH and fixed acidity.